R is a popular open-source programming language and environment for statistical computing and graphics. In this section, we will cover basic R syntax, functions, data structures, and data manipulation using the tidyverse.
# Arithmetic operations
2 + 3
## [1] 5
4 * 5
## [1] 20
6 / 2
## [1] 3
# Variables
x <- 10
y <- 20
x + y
## [1] 30
# Functions
mean(c(1, 2, 3, 4, 5))
## [1] 3
R has several data structures for organizing and storing data. Understanding these data structures is essential for effective data manipulation and analysis.
v <- c(1, 2, 3, 4, 5)
m <- matrix(1:9, nrow = 3, ncol = 3)
print(m)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
df <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
city = c("New York", "San Francisco", "Los Angeles")
)
print(df)
## name age city
## 1 Alice 25 New York
## 2 Bob 30 San Francisco
## 3 Charlie 35 Los Angeles
l <- list(
name = "Alice",
age = 25,
city = "New York"
)
print(l)
## $name
## [1] "Alice"
##
## $age
## [1] 25
##
## $city
## [1] "New York"
Each data structure has its strengths and applications, and choosing the right one depends on the specific needs of your data manipulation and analysis tasks.
# Install tidyverse and ggplot2 if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages("tidyverse")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
# Load the tidyverse and ggplot2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
R supports a wide variety of file formats for reading and writing data. Some common formats include:
The tidyverse packages, such as readr and
readxl, provide functions for reading data from many of
these formats. In this workshop, we’ll focus on reading data from CSV,
TSV, or Excel files.
# Read data from a CSV file
data_csv <- read_csv("your_data_file.csv")
## New names:
## Rows: 29 Columns: 22
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (11): imdb, title, test, clean_test, binary, code, director, director_ge... dbl
## (11): ...1, year, budget, domgross, intgross, budget_2013$, domgross_201...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# Display the first few rows of the data
head(data_csv)
## # A tibble: 6 × 22
## ...1 year imdb title test clean…¹ binary budget domgr…² intgr…³ code
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 0 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 2.57e7 4.22e7 2013…
## 2 1 2012 tt1343727 Dredd… ok-d… ok PASS 4.5 e7 1.34e7 4.09e7 2012…
## 3 2 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 5.31e7 1.59e8 2013…
## 4 3 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 7.56e7 1.32e8 2013…
## 5 4 2013 tt0453562 42 men men FAIL 4 e7 9.50e7 9.50e7 2013…
## 6 5 2013 tt1335975 47 Ro… men men FAIL 2.25e8 3.84e7 1.46e8 2013…
## # … with 11 more variables: `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
## # `intgross_2013$` <dbl>, `period code` <dbl>, `decade code` <dbl>,
## # director <chr>, director_gender <chr>, genre <chr>, rating <dbl>,
## # country <chr>, language <chr>, and abbreviated variable names ¹clean_test,
## # ²domgross, ³intgross
Remove # from second and last row to run example
# Read data from a TSV file
# data_tsv <- read_tsv("your_data_file.tsv")
# Display the first few rows of the data
# head(data_tsv)
# Install readxl package if not already installed
if (!requireNamespace("readxl", quietly = TRUE)) {
install.packages("readxl")
}
# Load the readxl package
library(readxl)
Remove # from second and last row to run example
# Read data from an Excel file
# data_excel <- read_excel("your_data_file.xlsx", sheet = "Sheet1")
# Display the first few rows of the data
# head(data_excel)
Replace "your_data_file.csv",
"your_data_file.tsv", and
"your_data_file.xlsx" with the appropriate file paths for
your dataset. For the Excel file, also specify the correct sheet name
using the sheet parameter in the read_excel()
function.
Tidyverse pipes, represented by the %>%
<ctrl+shift+M> symbol, allow you to chain together multiple
functions in a clear and readable manner. Pipes take the output of one
function and use it as the input for the next function, making it easy
to follow the sequence of data transformations.
In the following example, we’ll use pipes to perform a series of
operations on our dataset: 1. Group the data by a specific category
(group_by) 2. Calculate summary statistics for each group
(summarize) 3. Sort the resulting summary by a specific
statistic (arrange)
# Summarize the data
data_summary <- data %>%
group_by(YourCategory) %>%
summarize(
mean_value = mean(YourVariable, na.rm = TRUE),
n = n()
) %>%
arrange(desc(mean_value))
head(data_summary)
## # A tibble: 5 × 3
## YourCategory mean_value n
## <fct> <dbl> <int>
## 1 B 50.7 25
## 2 A 50.0 20
## 3 C 49.6 11
## 4 E 49.1 24
## 5 D 46.5 20
In the above code, replace YourCategory and
YourVariable with the appropriate column names from your
dataset. This will give a high-level overview of the data, summarizing
it by the specified category and calculating the mean of the specified
variable. The use of pipes makes it easy to understand the sequence of
transformations applied to the data.
Use the cell below to repeat the steps we have done so far. Download
data from here
(if you haven’t done so already). Read the data to R and try tidyverse
summarize, count, etc. Find more options from
data-wrangling-cheatsheet
btd <- read_tsv('bechdel_test - bechdel_test.tsv')
## New names:
## Rows: 1794 Columns: 22
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "\t" chr
## (11): imdb, title, test, clean_test, binary, code, director, director_ge... dbl
## (11): ...1, year, budget, domgross, intgross, budget_2013$, domgross_201...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
The ggplot2 package in R follows a modular paradigm based on the “Grammar of Graphics.” This modular approach allows users to build complex plots by combining simple components or layers. Each layer represents a specific element of the plot, such as data, aesthetics, geoms, scales, and themes.
Data: This is the foundation of any plot. You specify the dataset you want to visualize.
Aesthetics: Aesthetics are the visual properties of the plot, such as x and y position, color, size, shape, and transparency. You map variables in the dataset to these aesthetics, creating a relationship between the data and the plot elements.
Geoms: Geoms (short for geometric objects) are the actual plot elements, such as points, lines, and bars. Different geoms represent different types of plots, like scatter plots, line plots, or bar plots. You can add multiple geoms to a single plot to create complex visualizations.
Scales: Scales control how data values are mapped to aesthetic properties. They define the transformation and mapping of data values to visual properties, such as colors, sizes, or shapes. You can adjust scales to customize the appearance of the plot.
Themes: Themes control the non-data aspects of the plot, such as the background, gridlines, axis labels, and legend. You can customize the plot’s appearance by changing its theme.
In the ggplot2 modular paradigm, you start by specifying the data and the aesthetics, then add geoms, scales, and themes as needed. This layer-by-layer approach allows you to create a wide range of plots by combining and customizing these components.
Here’s an example that demonstrates the modular paradigm:
library(ggplot2)
# Define data and aesthetics
plot <- ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp))
plot
# Add geom
plot <- plot + geom_point()
plot
# Adjust scale
plot <- plot + scale_color_continuous(low = "blue", high = "red")
plot
# Apply theme
plot <- plot + theme_minimal()
# Display the final plot
plot
In this example, we first define the data and aesthetics, then add a point geom, adjust the color scale, and finally apply a minimal theme. The result is a scatter plot that shows the relationship between car weight and miles per gallon, with points colored according to horsepower.
Aesthetics are the visual properties of the elements in a plot. They
help convey the underlying patterns and relationships in the data. In
ggplot2, you map variables from your dataset to aesthetics to create a
relationship between the data and the plot elements. Common aesthetics
include x and y position, color,
size, shape, group, and
transparency.
Let’s create a scatter plot with the year variable
mapped to the x-axis, the revenue variable mapped to the
y-axis, and the bechdel_result variable mapped to the color
and shape aesthetics:
# btd <- read_tsv('bechdel_test - bechdel_test.tsv')
# Create a scatter plot with aesthetics mapped to variables
scatter_plot <- ggplot(btd, aes(x = year, y = rating, color = factor(genre))) +
geom_point()
scatter_plot
## Warning: Removed 3 rows containing missing values (geom_point).
You can modify aesthetic properties directly within a geom. This allows you to make specific adjustments to the appearance of individual plot elements. Let’s create a scatter plot with larger points and a custom transparency:
# Modify the size and transparency of points within the geom
scatter_plot_independent_aes <- ggplot(btd, aes(x = year, y = rating)) +
geom_point(size = 3, alpha = 0.3, color='gold')
scatter_plot_independent_aes
## Warning: Removed 3 rows containing missing values (geom_point).
scatter_plot_data_related_aes <- ggplot(btd, aes(x = year, y = rating)) +
geom_point(size = 3, alpha = 0.3, aes(color = factor(genre)))
scatter_plot_data_related_aes
## Warning: Removed 3 rows containing missing values (geom_point).
In this example, we have increased the size of the points and made
them semi-transparent by setting the size and
alpha arguments within the geom_point()
function, respectively. By modifying aesthetics directly within a geom,
you can overwrite the initial mappings created by the aes()
function and gain more control over the appearance of your plot
elements.
In this section, we will demonstrate various geom selections using simple dummy data. Each geom represents a specific type of plot, and their applicability depends on the type of data (numeric vs. categorical) and the relationship you want to visualize.
A scatterplot displays the relationship between two numeric variables by plotting points at their respective x and y coordinates. It’s useful for visualizing trends, patterns, or outliers in the data.
# Use the mtcars dataset as an example
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Create a scatterplot of mpg (miles per gallon) vs. wt (weight)
scatter_plot <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point()
scatter_plot
A line plot connects data points with lines to visualize the relationship between two numeric variables. It’s useful for showing trends over time or any continuous variable.
# Use the pressure dataset as an example
head(pressure)
## temperature pressure
## 1 0 0.0002
## 2 20 0.0012
## 3 40 0.0060
## 4 60 0.0300
## 5 80 0.0900
## 6 100 0.2700
# Create a line plot of temperature vs. pressure
line_plot <- ggplot(pressure, aes(x = temperature, y = pressure)) +
geom_line()
line_plot
A bar plot displays the frequency or count of categorical data, while a column plot displays the value of a numeric variable for each category. Both are useful for visualizing relationships between categorical and numeric variables.
# Use the diamonds dataset as an example
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# Create a bar plot of diamond counts by cut
bar_plot <- ggplot(diamonds, aes(x = cut)) +
geom_bar()
bar_plot
# Create a column plot of average price by cut
column_plot <- ggplot(diamonds, aes(x = cut, y = price)) +
geom_col()
column_plot
A histogram groups numeric data into bins and displays the frequency of observations in each bin. It’s useful for visualizing the distribution of a numeric variable.
# Create a histogram of the mpg variable from the mtcars dataset
histogram_plot <- ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2)
histogram_plot
A box plot displays the distribution of a numeric variable across different categories. It’s useful for comparing distributions and identifying outliers within categorical groups.
# Create a box plot of price by cut for the diamonds dataset
box_plot <- ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot()
box_plot
These are just a few examples of the many geoms
available in ggplot2. By selecting the appropriate geom for your data,
you can create informative and visually appealing plots that effectively
communicate the relationships within your dataset.
In scatter plots, data points can sometimes overlap, making it
difficult to discern individual observations. To address this issue, you
can apply a position adjustment, such as jitter, which
slightly moves the points in a random direction to reduce overlap.
# Create a scatter plot of the diamonds dataset with jittered points
jittered_plot <- ggplot(diamonds, aes(x = cut, y = price)) +
geom_point(position = "jitter", alpha = 0.5) +
theme_minimal()
jittered_plot
In this example, the points are jittered to reduce
overlap, making it easier to see the distribution of observations within
each cut category.
You can add multiple geoms to a single plot to create
more complex visualizations. For example, you can combine a scatter plot
with a smoothed line to show the overall trend, and add text labels to
annotate specific data points.
# Use the mtcars dataset as an example
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Create a scatter plot of mpg (miles per gallon) vs. wt (weight) with a smoothed line
scatter_plot_smooth <- ggplot(mtcars, aes(x = wt, y = mpg, label = rownames(mtcars))) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed", color = "blue") +
theme_minimal()
scatter_plot_smooth
## `geom_smooth()` using formula 'y ~ x'
# Annotate specific data points with text labels
scatter_plot_annotate <- scatter_plot_smooth +
geom_text(data = subset(mtcars, mpg > 30 | wt > 5),
size = 3, hjust = -0.2, vjust = 0.5,
mapping = aes(label=rownames(subset(mtcars, mpg > 30 | wt > 5)))) +
annotate(geom = 'text', x = 5, y = 30, label="Text that I can add here", color="blue")
scatter_plot_annotate
## `geom_smooth()` using formula 'y ~ x'
In this example, we first created a scatter plot of mpg
vs. wt and added a smoothed line using
geom_smooth. Then, we annotated specific data points with
text labels using geom_text. By combining multiple
geoms, you can create more informative and visually
appealing plots.
Use the cell below to repeat the steps we have done so far. Generate
a plot where you show diamonds data by cut and
price. Use geom_violin, what does it show and
how to interpret it? Add geom_point to there as well, make
it very light (transparent), use position jitter. Find more
options from data-wrangling-cheatsheet
Scaling variables allows you to transform their range, making it easier to visualize data that spans multiple orders of magnitude or to compare multiple variables with different units.
For numeric variables, you can use scale_x_continuous()
and scale_y_continuous() to adjust the scales of the x and
y axes. Let’s create a scatter plot of intcross
vs. rating and scale the axes using log
transformations:
# Create a scatter plot of intcross vs. rating with log-scaled axes
scatter_plot_scaled <- ggplot(btd, aes(x = rating, y = intgross)) +
geom_point() +
scale_y_continuous(trans = "log10")
scatter_plot_scaled
## Warning: Removed 14 rows containing missing values (geom_point).
In this example, we used log transformations to scale the
intcross variable. This can help reveal patterns in the
data that might not be apparent when using the original scales.
For discrete variables, you can use scale_x_discrete()
and scale_y_discrete() to modify the order or appearance of
the categories. Let’s create a box plot of genre
vs. rating and reorder the genres by median
rating:
# Calculate the median rating for each genre
genre_median <- btd %>%
group_by(genre) %>%
summarize(median_rating = median(rating, na.rm = TRUE)) %>%
arrange(median_rating) %>%
filter(!is.na(genre)) %>%
filter(!genre%in%c('Family', 'Musical', 'Romance', 'Thriller'))
# Create a box plot of genre vs. rating with reordered categories
box_plot_scaled <- ggplot(btd, aes(x = genre, y = rating)) +
geom_boxplot() +
scale_x_discrete(limits = genre_median$genre) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
box_plot_scaled
## Warning: Removed 8 rows containing missing values (stat_boxplot).
In this example, we reordered the genre categories based
on the median rating. This can help highlight differences
in the distribution of rating across genres and make it
easier to compare them.
Scaling variables can enhance the readability of your plots and reveal hidden patterns in your data. By applying appropriate transformations to numeric and discrete variables, you can create more effective visualizations.
Color scaling can be used to visualize an additional variable in a
plot, adding an extra dimension of information. In this example, we will
create a scatter plot of budget vs. intgross,
with the color of the points representing the rating.
# Create a scatter plot of budget vs. intgross with color-scaled ratings
scatter_plot_color <- ggplot(btd, aes(x = budget, y = intgross, color = rating)) +
geom_point(alpha = 0.4) +
scale_x_continuous(labels = scales::comma, trans = "log10") +
scale_y_continuous(labels = scales::comma, trans = "log10") +
scale_color_continuous(low = "blue", high = "red") +
theme_minimal()
scatter_plot_color
## Warning: Removed 11 rows containing missing values (geom_point).
In this example, we used scale_color_continuous() to
adjust the color scale based on the rating variable. The
points are colored from blue (low rating) to red (high rating),
providing an extra layer of information on top of the relationship
between budget and intgross.
By applying color scaling, you can create richer visualizations that display more information and enhance the understanding of your data.
Factor variables are categorical variables that can take a limited number of distinct values. In ggplot2, you can refactor and reorder factor variables to enhance your visualizations.
Refactoring involves changing the levels of a factor variable. You
can use the forcats package (part of the tidyverse) to
refactor variables. The fct_recode() function can be used
to recode the levels of a factor.
# Load the required packages
# library(forcats)
# Create a new column with a refactored clean_test variable
btd_refactored <- btd %>%
mutate(clean_test_refactored = fct_recode(clean_test,
"pass" = "ok",
"fail" = "nowomen",
"fail" = "notalk",
"fail" = "men",
"N/A" = "dubious"))
head(btd_refactored)
## # A tibble: 6 × 23
## ...1 year imdb title test clean…¹ binary budget domgr…² intgr…³ code
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 0 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 2.57e7 4.22e7 2013…
## 2 1 2012 tt1343727 Dredd… ok-d… ok PASS 4.5 e7 1.34e7 4.09e7 2012…
## 3 2 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 5.31e7 1.59e8 2013…
## 4 3 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 7.56e7 1.32e8 2013…
## 5 4 2013 tt0453562 42 men men FAIL 4 e7 9.50e7 9.50e7 2013…
## 6 5 2013 tt1335975 47 Ro… men men FAIL 2.25e8 3.84e7 1.46e8 2013…
## # … with 12 more variables: `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
## # `intgross_2013$` <dbl>, `period code` <dbl>, `decade code` <dbl>,
## # director <chr>, director_gender <chr>, genre <chr>, rating <dbl>,
## # country <chr>, language <chr>, clean_test_refactored <fct>, and abbreviated
## # variable names ¹clean_test, ²domgross, ³intgross
In this example, we refactored the clean_test variable
by changing the levels from “yes” and “no” to “pass” and “fail”.
Reordering involves changing the order of the levels of a factor
variable. You can use the fct_reorder() function to reorder
the levels of a factor based on the values of another variable.
# Create a new column with reordered genre based on the median intgross
btd_reordered <- btd_refactored %>%
mutate(genre_reordered = fct_reorder(genre, intgross, .fun = median, .desc = TRUE)) %>%
filter(!(is.na(genre) | genre%in%c('Family', 'Musical', 'Romance', 'Thriller')))
head(btd_reordered)
## # A tibble: 6 × 24
## ...1 year imdb title test clean…¹ binary budget domgr…² intgr…³ code
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 0 2013 tt1711425 21 &a… nota… notalk FAIL 1.3 e7 2.57e7 4.22e7 2013…
## 2 1 2012 tt1343727 Dredd… ok-d… ok PASS 4.5 e7 1.34e7 4.09e7 2012…
## 3 2 2013 tt2024544 12 Ye… nota… notalk FAIL 2 e7 5.31e7 1.59e8 2013…
## 4 3 2013 tt1272878 2 Guns nota… notalk FAIL 6.1 e7 7.56e7 1.32e8 2013…
## 5 4 2013 tt0453562 42 men men FAIL 4 e7 9.50e7 9.50e7 2013…
## 6 5 2013 tt1335975 47 Ro… men men FAIL 2.25e8 3.84e7 1.46e8 2013…
## # … with 13 more variables: `budget_2013$` <dbl>, `domgross_2013$` <dbl>,
## # `intgross_2013$` <dbl>, `period code` <dbl>, `decade code` <dbl>,
## # director <chr>, director_gender <chr>, genre <chr>, rating <dbl>,
## # country <chr>, language <chr>, clean_test_refactored <fct>,
## # genre_reordered <fct>, and abbreviated variable names ¹clean_test,
## # ²domgross, ³intgross
In this example, we reordered the levels of the genre
variable based on the median value of intgross for each
genre. This can be useful for creating plots where the categories are
ordered meaningfully.
Now let’s create a bar plot of the reordered genres with the refactored clean_test variable:
# Create a bar plot of the refactored clean_test and reordered genre variables
bar_plot_reordered <- ggplot(btd_reordered, aes(x = genre_reordered, fill = clean_test_refactored)) +
geom_bar(position = "dodge") +
labs(x = "Genre (reordered)", fill = "Clean Test (refactored)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_plot_reordered
In this plot, we used the refactored clean_test variable
and the reordered genre variable to create a more
informative visualization. By refactoring and reordering factor
variables, you can enhance the readability and effectiveness of your
plots.
Adding titles, axis labels, limits, and customizing the theme can make your plots more informative and visually appealing. In this section, we will demonstrate how to enhance a plot using these features.
Let’s create a scatter plot of budget
vs. intgross and use the rating column for
color scaling:
scatter_plot_example <- ggplot(btd, aes(x = budget, y = intgross, color = rating)) +
geom_point(alpha = 0.7) +
scale_x_continuous(labels = scales::comma, trans = "log10") +
scale_y_continuous(labels = scales::comma, trans = "log10") +
scale_color_continuous(low = "blue", high = "red") +
theme_minimal()
scatter_plot_example
## Warning: Removed 11 rows containing missing values (geom_point).
You can add a main title, subtitle, and axis labels using the
labs() function:
scatter_plot_labeled <- scatter_plot_example +
labs(
title = "Budget vs. International Gross",
subtitle = "Colored by Rating",
x = "Budget (log scale)",
y = "International Gross (log scale)",
color = "Rating"
)
scatter_plot_labeled
## Warning: Removed 11 rows containing missing values (geom_point).
To set the limits for the x and y axes, you can use the
xlim() and ylim() functions:
scatter_plot_limited <- scatter_plot_labeled +
xlim(100000000, 3e+8) +
ylim(100000000, 3e+8) +
geom_point(mapping = aes(size = rating), alpha=0.4)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
scatter_plot_limited
## Warning: Removed 1735 rows containing missing values (geom_point).
## Removed 1735 rows containing missing values (geom_point).
You can customize the appearance of your plot by modifying the theme.
The theme() function allows you to change various aspects
of the plot, such as text size, font, background colors, and grid
lines:
scatter_plot_custom_theme <- scatter_plot_limited +
theme(
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 14),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
panel.background = element_rect(fill = "white"),
panel.grid.major = element_line(color = "gray", linetype = "dashed", size = 0.5),
panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.25)
)
scatter_plot_custom_theme
## Warning: Removed 1735 rows containing missing values (geom_point).
## Removed 1735 rows containing missing values (geom_point).
By adding titles, axis labels, limits, and customizing the theme, you can create more informative and visually appealing plots that effectively communicate your data’s story. See cheatsheet for additional themes.
ggsave() is a convenient function for saving your
ggplot2 plots in various file formats, such as PNG, PDF, SVG, or TIFF.
Using ggsave(), you can easily save high-quality versions
of your plots for use in reports, presentations, or publications.
Let’s create a scatter plot of budget
vs. intgross with the rating column for color
scaling:
scatter_plot_example <- ggplot(btd, aes(x = budget, y = intgross, color = rating)) +
geom_point(alpha = 0.7) +
scale_x_continuous(labels = scales::comma, trans = "log10") +
scale_y_continuous(labels = scales::comma, trans = "log10") +
scale_color_continuous(low = "blue", high = "red") +
labs(
title = "Budget vs. International Gross",
subtitle = "Colored by Rating",
x = "Budget (log scale)",
y = "International Gross (log scale)",
color = "Rating"
) +
theme_minimal()
scatter_plot_example
## Warning: Removed 11 rows containing missing values (geom_point).
To save this plot as a high-quality PNG image, you can use the
ggsave() function:
# Save the plot as a PNG file
ggsave(
filename = "scatter_plot_example.png",
plot = scatter_plot_example,
width = 8,
height = 5,
dpi = 300
)
## Warning: Removed 11 rows containing missing values (geom_point).
In this example, we saved the scatter_plot_example plot
as a PNG file with a width of 8 inches, a height of 5 inches, and a
resolution of 300 dots per inch (DPI). You can adjust the
width, height, and dpi parameters
to control the size and quality of the saved image. Here
plot = argument may be optional – if plot is not assigned
it will by default take the last active plot.
To save the plot in a different file format, you can change the file
extension in the filename parameter. For example, to save
the plot as a PDF, you can use:
# Save the plot as a PDF file
ggsave(
filename = "scatter_plot_example.pdf",
plot = scatter_plot_example,
width = 8,
height = 5
)
## Warning: Removed 11 rows containing missing values (geom_point).
By using ggsave(), you can easily export and save your
ggplot2 plots in a variety of file formats to share or include in your
documents.
The patchwork package allows you to easily combine
multiple ggplot2 plots into a single layout. This can be useful for
comparing different visualizations side by side or creating more complex
visualizations.
First, let’s create some example plots using the mtcars
dataset:
# Load the required packages
library(ggplot2)
library(patchwork)
# Create a scatter plot of mpg vs. wt
scatter_plot <- ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(aes(color = cyl)) +
labs(title = "Miles per Gallon vs. Weight", x = "Weight (1000 lbs)", y = "Miles per Gallon", color = "Cylinders") +
theme_minimal() +
theme(title = element_text(size = 12, face = "bold"), plot.subtitle = element_text(size = 10))
# Create a bar plot of the number of cars per number of cylinders
bar_plot <- ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(aes(fill = factor(cyl))) +
labs(title = "Number of Cars per Number of Cylinders", x = "Number of Cylinders", y = "Count", fill = "Cylinders") +
theme_minimal() +
theme(plot.title = element_text(size = 8, face = "bold"))
# Create a box plot of mpg per number of cylinders
box_plot <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(aes(fill = factor(cyl))) +
labs(title = "Miles per Gallon per Number of Cylinders", x = "Number of Cylinders", y = "Miles per Gallon", fill = "Cylinders") +
theme_minimal() +
theme(plot.title = element_text(size = 8, face = "bold"))
Now, let’s use the patchwork package to combine these
plots:
# Combine the plots using patchwork
combined_plot <- scatter_plot + bar_plot + box_plot + plot_layout(ncol = 1)
combined_plot
In this example, we combined the three plots into a single column
layout. You can adjust the layout by changing the ncol and
nrow parameters in the plot_layout()
function.
To collect the legends and add a global title, subtitle, and caption,
you can use the plot_annotation() function:
# Collect legends and add global title, subtitle, and caption
combined_plot_annotated <- scatter_plot / (bar_plot + box_plot) +
plot_annotation(
title = "Exploring the mtcars Dataset",
subtitle = "Scatter plot, bar plot, and box plot",
caption = "Data source: mtcars",
theme = theme(plot.title = element_text(size = 13, face = "bold"), plot.subtitle = element_text(size = 11)),
tag_levels = 'A'
) +
plot_layout(guides = "collect")
combined_plot_annotated
By using the patchwork package, you can combine multiple
ggplot2 plots into a single layout, making it easier to compare and
present your visualizations.
Now it’s your time to shine. Let you imagination fly and explore the bechdel test dataset. Come up with your own visualisation. Use as many plots, as many colours and themes as you see fit. Start building it gradually, this makes finding and identifying errors easier. You can even take your own data and explore it. But try to get to a publishing ready image (sans figure legend).
Find more options from data-wrangling-cheatsheet.